Gallego Romero et al. BMC Biology 2014, 12:42 
http://www.biomedcentral.eom/1 741 -7007/1 2/42 



BMC Biology 



RESEARCH ARTICLE Open Access 



RNA-seq: impact of RNA degradation on 
transcript quantification 

Irene Gallego Romero 1 , Athma A Pai 1,2 , Jenny Tung 1,3 and Yoav Gilad 1 * 



Abstract 

Background: The use of low quality RNA samples in whole-genome gene expression profiling remains controversial. 
It is unclear if transcript degradation in low quality RNA samples occurs uniformly, in which case the effects of 
degradation can be corrected via data normalization, or whether different transcripts are degraded at different rates, 
potentially biasing measurements of expression levels. This concern has rendered the use of low quality RNA 
samples in whole-genome expression profiling problematic. Yet, low quality samples (for example, samples 
collected in the course of fieldwork) are at times the sole means of addressing specific questions. 

Results: We sought to quantify the impact of variation in RNA quality on estimates of gene expression levels 
based on RNA-seq data. To do so, we collected expression data from tissue samples that were allowed to decay 
for varying amounts of time prior to RNA extraction. The RNA samples we collected spanned the entire range of 
RNA Integrity Number (RIN) values (a metric commonly used to assess RNA quality). We observed widespread 
effects of RNA quality on measurements of gene expression levels, as well as a slight but significant loss of library 
complexity in more degraded samples. 

Conclusions: While standard normalizations failed to account for the effects of degradation, we found that by 
explicitly controlling for the effects of RIN using a linear model framework we can correct for the majority of these 
effects. We conclude that in instances in which RIN and the effect of interest are not associated, this approach can 
help recover biologically meaningful signals in data from degraded RNA samples. 
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Background 

Degradation of RNA transcripts by the cellular machin- 
ery is a complex and highly regulated process. In live 
cells and tissues, the abundance of mRNA is tightly reg- 
ulated, and transcripts are degraded at different rates by 
various mechanisms [1], partially in relation to their bio- 
logical function [2-5]. In contrast, the fates of RNA tran- 
scripts in dying tissue, and the decay of isolated RNA 
are not part of normal cellular physiology and, therefore, 
are less likely to be tightly regulated. It remains largely 
unclear whether most transcript types decay at similar 
rates under such conditions or whether rates of RNA 
decay in dying tissues are associated with transcript- 
specific properties. 
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These questions are of great importance for studies 
that rely on sample collection in the field or in clinical 
settings (both from human populations as well as from 
other species), in which tissue samples often cannot im- 
mediately be stored in conditions that prevent RNA deg- 
radation. In these settings, extracted RNA is often partly 
degraded and may not faithfully represent in vivo gene 
expression levels. Sample storage in stabilizers like RNA- 
Later lessens this problem [6] but is not always feasible. 
Differences in RNA quality and sample handling could, 
therefore, confound subsequent analyses, especially if 
samples subjected to different amounts of degradation 
are naively compared against each other. The degree to 
which this confounder affects estimates of gene expres- 
sion levels is not well understood. 

There is also no consensus on the level of RNA decay 
that renders a sample unusable or on approaches to con- 
trol for the effect of ex vivo processes in the analysis of 
gene expression data. Thus, while standardized RNA 
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quality metrics such as the Degradometer [7] or the 
RNA Integrity Number (RIN; [8]), provide well-defined 
empirical methods to assess and compare sample qual- 
ity, there is no widely accepted criterion for sample in- 
clusion. For example, proposed thresholds for sample 
inclusion have varied between RIN values as high as 8 
[9] and as low as 3.95 [10]. The recent Genotype-Tissue 
Expression (GTEx) project [11], for instance, reports 
both the number of total RNA samples they collected as 
well as the number of RNA samples with RIN scores 
higher than 6, presumably as a measure of the number 
of high quality samples in the study. 

Broadly speaking, three approaches can be adopted to 
deal with RNA samples of variable quality. First, RNA 
samples with evidence of substantial degradation can be 
excluded from further study; this approach relies on es- 
tablishing a cut-off value for 'high quality versus low 
quality' samples and suffers from the current lack of 
consensus on what this cut-off should be. It also could 
exclude the possibility of utilizing unique and difficult to 
collect samples from remote locations or historical col- 
lections. Second, if investigators are willing to assume 
that all transcript types decay at a similar rate, variation 
in gene expression estimates due to differences in RNA 
integrity could be accounted for by applying standard 
normalization procedures. Third, if different transcripts 
decay at different rates, and if these rates are consistent 
across samples for a given level of RNA degradation - 
for example, a given RIN value - a model that explicitly 
incorporates measured, sample-specific, degradation levels 
could be applied to gene expression data to correct for the 
confounding effects of degradation. 

To date, most studies apply a combination of the first 
two approaches: an application of an arbitrary RNA 
quality cutoff (typically based on RIN score), followed by 
standard normalization of the data, which assumes that 
RNA samples at any RIN value higher than the chosen 
cutoff are not subjected to transcript-specific decay 
rates. However, current work on the effects of RNA 
decay has not yet provided clear guidelines with respect 
to these approaches. In addition, nearly all published 
work that focuses on RNA stability in tissues following 
cell death and/or sample isolation predates, or does not 
employ, high throughput sequencing technologies. These 
studies broadly suggest that both the quantity and qual- 
ity of recovered RNA from tissues can be affected by 
acute pre-mortem stressors, such as pyrexia or pro- 
longed hypoxia [12-14], and by the time to sample pres- 
ervation and RNA extraction. The quantity and quality 
of recovered RNA are strongly dependent on the type of 
tissue studied [15], even when sampling from the same 
individual [16,17]. These differences in yield across tis- 
sues have resulted in a wide range of recommendations 
for an acceptable post-mortem interval for extracting 



usable, high-quality RNA, ranging from as little as 10 mi- 
nutes [18] to upwards of 48 hours [19], depending on 
tissue source and preservation conditions. 

Similarly, studies examining changes in the relative 
abundance of specific transcripts as a result of ex vivo 
RNA decay have reached somewhat contradictory recom- 
mendations. Some of this conflict may be attributable to 
methodological differences. Studies that focused on small 
numbers of genes assayed through quantitative PCR con- 
sistently report little to no effect of variation in RNA qual- 
ity on gene expression estimates [6,19-22], Conversely, 
microarray-based studies have repeatedly reported signifi- 
cant effects of variation of RNA quality on gene expression 
estimates, even after applying standard normalization ap- 
proaches. Increasing the time from tissue harvesting to 
RNA extraction or cryopreservation from 0 to only 40 or 
60 minutes, for example, significantly affected expression 
profiles in roughly 70% of surveyed genes in an experiment 
on human colon cancer tissues [20]. Likewise, a substantial 
fraction of genes in peripheral blood mononuclear cells 
(PBMCs) appears to be sensitive to ex vivo incubation 
[21]. Other microarray-based studies have reached similar 
conclusions, both in samples from humans [15,16,22,23] 
and other organisms [24], and have urged caution when 
analyzing RNA samples with medium or low RIN scores, 
although the definition of an acceptable RNA quality 
threshold remains elusive. 

To examine the effects of RNA degradation in a set- 
ting relevant to field study sample collection, we se- 
quenced RNA extracted from PBMC samples that were 
stored unprocessed at room temperature for different 
time periods, up to 84 hours. We collected RNA decay 
time-course data spanning almost the entire RIN quality 
scale and examined relative gene-specific degradation 
rates through RNA sequencing. Due to the high sensitiv- 
ity and resolution of high-throughput RNA sequencing, 
our data provide an unprecedentedly detailed picture of 
the dynamics of RNA degradation in stressed, ex vivo 
cells. Based on our results, we develop specific recom- 
mendations for accounting for these effects in gene ex- 
pression studies. 

Results 

We extracted RNA from 32 aliquots of PBMC samples 
from four individuals. The PBMC samples were stored 
at room temperature for 0 hours, 12 hours, 24 hours, 
36 hours, 48 hours, 60 hours, 72 hours and 84 hours prior 
to RNA extraction. As expected, time to extraction sig- 
nificantly affected the RNA quality (P <10" n ), with mean 
RIN = 9.3 at 0 hours and 3.8 at 84 hours [see Additional 
file 1: Table SI]. Based on the RIN values we chose to focus 
on 20 samples from five time points (0 hours, 12 hours, 
24 hours, 48 hours and 84 hours) that spanned the entire 
scale of RNA quality. We generated poly-A-enriched RNA 
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sequencing libraries from the 20 samples using a standard 
RNA sequencing library preparation protocol (see [25]). 
We added a spike-in of non-human control RNA to each 
sample, which allowed us to confirm the effects of RNA 
degradation on the RNA sequencing results (see Methods 
for more details). Following sequencing, we randomly sub- 
sampled all libraries to a depth of 12,129,475 reads, the 
lowest number of reads/library observed in the data. We 
used BWA 0.6.3 to map reads, calculated reads per kilo- 
base transcript per million (RPKM), and normalized the 
data using a standard quantile normalization approach (for 
example, as in [26]). We observed that sample RIN is 
associated with both the number of uniquely mapped 
reads (analysis of variance (ANOVA) P <10" 3 ) and the 
number of reads mapped to genes (P <1(T 3 ; Additional 
file 2: Figure SI), with high RIN samples having greater 
numbers of both. Furthermore, the proportion of exogen- 
ous spike-in reads increases significantly as RIN decreases 
(P <10" 10 ), as expected given degradation-driven loss 
of intact human transcripts in poor quality samples. Se- 
quence reads from individual 2 were poorly mapped, espe- 
cially in the later time-points (see Methods and Additional 
file 2: Figure SI); we thus excluded the data from all sam- 
ples from this individual in subsequent analysis. 

The effect of RNA degradation on RNAseq output 

Principal component analysis of our data demonstrates 
that much of the variation (28.9%) in gene expression 
levels in our study is strongly associated with RNA sam- 
ple RIN scores (Figure 1A; principal component 1 (PCI) 



associated with RIN scores P <10~ 7 ; no other PCs are 
significantly associated with either sample storage time 
or RIN score; Additional file 3: Table S2). We also ob- 
served a residual presence of inter- individual variation in 
the data, in spite of variable RNA quality (PCs 4 and 5; 
Additional file 4: Figure S2 and Additional file 3: Table 
S2). A correlation matrix based on the gene expression 
data (Figure IB) indicates that while samples of relatively 
high quality RNA cluster by individual, data from RNA 
samples that experienced high yet similar degradation 
levels are more correlated than data from samples from 
the same individual across the time-points. This pattern 
contrasts with the naive expectation that gene expres- 
sion differences between individuals should be the 
strongest signal in the normalized data. Instead, inter- 
individual differences only predominate in the early stages 
of degradation, at the early time-points of 0 hours (mean 
RIN = 9.3) and 12 hours (mean RIN = 7.9). These observa- 
tions are robust with respect to the approach used to esti- 
mate gene expression levels and - importantly - are not 
explained by unequal rates of degradation occurring at dif- 
ferent distances from the 3' poly- A tail. For example, we 
found nearly identical patterns when we estimated expres- 
sion levels based only on reads that map to the 1,000 bp at 
the 3 ' end of each gene (Additional file 5: Figure S3). Simi- 
larly, these effects are robust to the choice of mapping al- 
gorithm. Because BWA does not map reads across exon 
splice junctions, we also remapped our data (excluding in- 
dividual 2) using TopHat 2.0.8 [27]. As expected, we found 
a high correlation between RPKM estimates based on 




PC1 (28.9% of variance) 

Figure 1 Broad effects of RNA degradation. A) PCA plot of the 15 samples included in the study based on data from 29,156 genes with at 
least one mapped read in a single individual. Different colors identify different time-points, while each shape indicates a particular individual in 
the data set. B) Spearman correlation plot of the 15 samples in the study. PCA, principal component analysis. 
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alignments with both approaches (Spearman's p = 0.82 
when we consider all genes with at least one observation 
of RPKM > =0.3 in the entire data set; Spearman's p = 0.85 
when we only consider genes with at least one observation 
of RPKM > =0.3 using data mapped with BWA, Additional 
file 6: Figure S4 and Additional file 7: Figure S5). Finally, 
we found that the global effects of RNA degradation on 
estimated gene expression levels could not be elimi- 
nated by globally regressing out RIN scores [see Additional 
file 8: Figure S6]. 

The possibility of reduced sequencing library complexity 
is often cited as a reason to exclude RNA samples of low 
quality. This concern is primarily based on the observation 
that sequencing RNA samples of lower RNA quality results 
in relatively decreased proportions of mappable reads, an 
observation corroborated in our study [see Additional 
file 2: Figure SI]. Yet, it is unclear to what extent this 
property affects the ability to estimate gene expression 
levels in RNA samples of low quality. To assess the effects 
of RIN on sample complexity, we plotted the distribution 



of RPKM values within individuals at different time points. 
Our data indicate that mean RPKM increases as sample 
RIN decreases (P <1(T 5 Additional file 9: Figure S7). This 
seems counterintuitive, but can be explained by the pres- 
ence of a few highly expressed genes in the samples of low 
RNA quality. Indeed, relative to 0 hours, low RIN samples 
at 48 hours and 84 hours have an excess of low RPKM 
genes and a deficit of high RPKM genes, shifting the me- 
dian RPKM downwards (P <1(T 4 ; Figure 2). We further 
found a positive association between the number of genes 
with at least one observation of RPKM >0.3 and RIN 
(P <10" 4 ). Even when we subsampled all samples to the 
same number of sequencing reads, we still observed a high 
proportion of genes with low RPKM values in RNA sam- 
ples of lower quality (P <1(T 4 ; Additional file 10: Figure S8). 
This suggests that a non-uniform effect of RNA degrad- 
ation on gene expression levels results in somewhat lower 
complexity of the sequencing library (Figure 2, Additional 
file 10: Figure S8). On the other hand, both within a single 
individual and across the whole dataset, we found that 
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log 10 RPKM log 10 RPKM 
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log 10 RPKM log 10 RPKM 

Figure 2 Changes in library complexity over time. Dashed lines indicate median RPKM at each time-point. A) Density plots of RPKM values 
among all three individuals at 0 hours and 12 hours. B) as A, but 0 hours and 24 hours. C) as A, but 0 hours and 48 hours. D) as A, but 0 hours 
and 84 hours. RPKM, reads per kilobase transcript per million. 
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nearly all genes whose expression could be measured at 
0 hours are also detected as expressed throughout the en- 
tire time-course experiment. Only a few genes (Table 1) 
present in all individuals up until a given time point were 
completely absent from the data at later time points. 

Different transcripts are degraded at different rates 

We sought to understand better the nature of transcript 
degradation in the RNA samples of lower quality. Given 
our time course study design, we were able to estimate 
degradation rates for all genes detected as expressed at all 
five time-points. To do so, we fit a log-normal transform 
of a simple exponential decay function (see Methods) to 
quantile-normalized RPKM values for each gene that was 
detected as expressed in all individuals at all time-points. 
We considered the slope of this function, /<, to be a proxy 
for the decay rate of the gene. We then compared this 
slope to the mean transcript degradation rate across all 
genes, which, as a result of our quantile normalization ap- 
proach, is equal to 0 (thus, a value of 0 indicates no change 
in the relative rank of that transcripts expression level 
across time points). If all genes decay at the same rate, then 
no slopes should significantly differ from the mean value. 
However, at a false discovery rate (FDR) threshold of 1%, 
we found that 7,267 of the 11,923 genes tested (60.95%; see 
Methods) were associated with degradation rates that were 
significantly different from the mean (Figure 3; Additional 
file 11: Table S3). Of these genes, 3,522 had a negative 
slope (that is, they were degraded significantly faster than 
the mean degradation rate) and 3,745 had a positive 
slope (that is, these transcripts were degraded signifi- 
cantly slower than the mean degradation rate). 

Although we might expect RNA degradation in decaying 
cells to be a random process, gene ontology (GO) analysis 
identified 118 and 293 significantly overrepresented cat- 
egories among slowly and rapidly degraded genes, re- 
spectively (FDR = 5%; Additional file 12: Tables S4 and 
Additional file 13: Table S5). We present the functional 
enrichment results only as an indication that the rate of 
transcript decay is not random. These observations are 
robust to different normalization approaches, to the in- 
clusion of RIN as a covariate in our linear model, and to 
fitting slopes using RIN instead of time-points. Limiting 
our analyses to the 1,000 bp closest to the 3' end of 
transcripts also yields similar results. 



We asked what properties, beyond GO functional cat- 
egories, might be associated with the observed variation 
in transcript degradation rates. We found that the cod- 
ing DNA sequence (CDS) length (P <10" 12 ), %GC content 
(P <10" 4 ), and 3 'UTR length (P <10" 15 ) are all significantly 
correlated with estimated transcript degradation rate 
(Figure 4A-C), with higher %GC content and increased 
length of both the 3 ' UTR and CDS all associated with fas- 
ter degradation. However, we found that total transcript 
length (5 ' UTR + CDS + 3 ' UTR) is not significantly corre- 
lated with degradation rates; instead, targets of both fast and 
slow degradation have longer transcripts than those that are 
degraded at an average rate (Figure 4D). The correlation 
between %GC content and CDS length is high (p = -0.19, 
P <1(T 16 ), but even when we control for the effects of either 
variable, the individual effects remain significant predictors 
of degradation rates (P <10~ 7 ). Our data thus suggest that 
both CDS length and %GC content affect degradation rate, 
and that observed degradation rates result from complex in- 
teractions between multiple forces. We again present these 
results as evidence for the non-random nature of the tran- 
script degradation rate (yet, we do not presume at this time 
to offer mechanistic explanations for these correlations). 

We also sought to investigate whether targets of slow, 
fast, or average degradation differ meaningfully in terms 
of broad biological categories. As expected given our 
poly-A enrichment strategy, most transcripts in our data 
originate from intact protein-coding genes, but we also 
observed four other biotypes represented by more than 
100 distinct transcripts. The distribution of these bio- 
types across rapidly and slowly degraded transcripts is 
not random, with a significant enrichment of pseudo- 
genes among transcripts that degrade slowly (P = 0.015), 
and an enrichment of intact protein-coding genes among 
the rapidly degraded transcripts (P <1(T 16 , Figure 4E). 

Controlling for the effect of RNA degradation in analyses 
of differential expression 

Ultimately, the goal of most RNA sequencing studies is 
to estimate variation in gene expression levels or to 
identify genes that are differentially expressed between 
conditions, individuals or states. We thus considered 
the effects of RNA quality on measures of relative gene 
expression levels between time-points and on overall es- 
timates of inter-individual variation in gene expression. 



Table 1 Genes observed in all individuals until or after a particular time point 

Seen until 0 hours 12 hours 24 hours 48 hours 84 hours 

#genes 14 9 72 52 11,923 

Mean RPKM when seen 0.68 0.679 1.29 1.09 32.689 

Unseen before 0 hours 12 hours 24 hours 48 hours 84 hours 

#genes n/a 4 2 19 35 

Mean RPKM when seen n/a 1.078 2.212 2.769 3.034 
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Figure 3 Log 10 median abundance of genes across all three individuals relative to 0 hours. Plots are separated by slope. A) Transcripts 
with significantly slow rates of degradation relative to the mean rate (identified at 1% FDR, n = 3,745). B) Transcripts that are degraded at a rate 
close to the mean cellular rate (n =4,656). C) Transcripts with significantly fast rates of degradation relative to the mean rate (identified at 1% 
FDR, n = 3,522). In all plots, the thick dashed line indicates the median degradation rate for all genes in that group, whereas the thin dashed line 
denotes no change in degradation rate relative to 0 hours. FDR, false discovery rate. 



As a first step we analyzed the normalized expression 
data using a generalized linear model (GLM) approach (see 
Methods) to classify genes as differentially expressed be- 
tween 0 hours and any other time-point. We only consid- 
ered genes with at least one mapped read in all individuals 
at all time-points (n = 14,094). At an FDR of 5%, we identi- 
fied 608 (4%) genes as differentially expressed by 12 hours. 
Both the number of differentially expressed genes and 
the magnitude of expression changes increase drastically 
along the time-course experiment (Table 2). By 84 hours, 
9,998 genes (71%) are differentially expressed (FDR = 5%). 
Roughly half of these genes appear to be more highly 
expressed in the later time-points than at 0 hours. This 
may seem counterintuitive given that the change in expres- 
sion is most likely the result of RNA degradation, yet this 
apparent increase in expression is due to our normalization 
approach (all transcripts in our experiment experience 
some level of degradation throughout the time course). 



Post normalization of the data, an apparent elevated ex- 
pression level in the later time points, therefore, indicates 
slow degradation relative to the genome-wide mean rate of 
RNA decay. 

As expected, when we include RIN as a covariate in the 
model the number of differentially expressed genes across 
time-points is drastically reduced (fewer than 50 genes are 
classified as differentially expressed between 0 hours and 
any other time-point; Table 2). These observations confirm 
that RIN is a robust indicator of degradation levels. With- 
out accounting for RIN, the effect of variation in RNA 
quality on our data is overwhelming. To understand these 
effects better, we explored whether accounting for variation 
in RIN increased the power to detect other sources of (bio- 
logically relevant) variation in RNA-seq data, such as the 
variation in gene expression between individuals. We also 
investigated several alternative approaches for controlling 
for variation in RNA quality. 
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Figure 4 Characteristics of rapidly and slowly degraded transcripts. In all plots, rapidly degraded transcripts are plotted in gold, transcripts 
degraded at an average rate are plotted in grey and slowly degraded transcripts are in red. A) By transcript %GC content. B) By coding region 
length. C) By 3'UTR length. D) By complete transcript length. E) By ENSEMBL biotype. 



Without accounting for RIN, we classified few genes (48 to 
100; Table 2) as differentially expressed between pairs of indi- 
viduals. This property of the data is captured by a heat map 
of sample pairwise correlation calculated using only the top 
10% (1,410) most variable genes across individuals at 0 hours. 
As can be seen in Figure 5A, while at the early time-points 
inter-individual differences are the predominant source of 
variation in the data, degradation overwhelms these differ- 
ences in the low quality (low RIN) RNA samples from 



48 hours and 84 hours. Hence, inclusion of these time points 
in our GLM, which considers samples from the same indi- 
vidual but different time points as 'technical replicates,' ob- 
scures much of the true signal of inter-individual variability. 

To recover this signal, we tested two approaches for 
explicitly accounting for RIN when estimating differen- 
tial gene expression across individuals: (1) incorporating 
RIN as a covariate in our GLM; and (2) analyzing the re- 
siduals of gene expression levels after first regressing out 



Table 2 Number of identified DE 


genes 










GLM: reads approximate time point 




Time point 


GLM 


GLM + RIN 


Regress RIN, GLM 


0 h versus 12 h 


608 


5 


26 


0 h versus 24 h 


3,704 


5 


203 


0 h versus 48 h 


8,756 


47 


5 


0 h versus 84 h 


9,998 


42 


0 






GLM: reads approximate individual b 




Individuals 


GLM 


GLM + RIN 


Regress RIN, GLM 


Ind 1 versus Ind 3 


69 


553 


268 


Ind 1 versus Ind 4 


48 


401 


190 


Ind 3 versus Ind 4 


100 


573 


299 


b, Treating time as technical replicates; DE, Differentially expressed; GLM, generalized linear model; H, hours; RIN, RNA integrity number. 



Gallego Romero et al. BMC Biology 2014, 12:42 
http://www.biomedcentral.eom/1 741 -7007/1 2/42 



Page 8 of 13 



RIN from the normalized gene expression data (Table 2). 
Both approaches result in the identification of many more 
genes as differentially expressed between individuals (401 
to 573 when incorporating RIN directly into our GLM, 
190 to 299 when testing for differential expression using 
residuals; Table 2). We also repeated the pairwise correl- 
ation analysis using the same 1,410 most variable genes 
identified above, but this time we used the residuals after 
regressing the effect of RIN from the data. The residuals 
cluster well by individual throughout the entire time 
course experiment, regardless of RNA quality (Figure 5B). 

Finally, we examined the overlap between the subset 
of the 10% of most variable genes across individuals at 
0 hours (the 1,410 genes used to generate Figure 5) and 
those identified as differentially expressed across individ- 
uals as described above (Table 3). Of the two approaches 
we employed to account for the effect of RIN, testing for 
differential expression after removing the effects of RIN 
on the data (method 2) yielded higher concordance be- 
tween DE genes and those with high inter- individual 
variance at 0 hours, suggesting it may be a better ap- 
proach than simply including RIN as a covariate. 

Discussion 

Our observations indicate that the effects of RNA deg- 
radation following death or tissue isolation are pervasive 
and can rapidly obscure inter- individual differences in 
gene expression. Yet, we also found that by using RNA- 
seq nearly all genes observed at our first time-point 
could still be detected even in severely degraded RNA 



samples, although the estimated relative expression levels 
were drastically affected by degradation. Although post- 
mortem RNA degradation is considered a non-regulated 
process, some of the traditional predictors of regulated 
RNA decay rates in the cell are also associated with vari- 
ation in RNA quality in our data. For example, longer pro- 
tein coding regions and 3 ' UTRs are correlated with more 
rapid degradation, similar to previously reported trends 
[5,28,29]. Total transcript length, however, which is a 
significant predictor of regulated RNA decay in the cell, 
is not associated with variation in degradation rates in 
our data. 

The effect of RNA degradation can be accounted for 

We confirmed previous observations of decreasing data 
quality as time from tissue extraction to RNA isolation in- 
creased [see Additional file 2: Figure SI], both with respect 
to the number of high quality reads we were able to gener- 
ate from our sequencing libraries and library complexity. 
While increased time to RNA extraction did not generally 
result in the complete loss of transcripts (less than 8% of 
transcripts are lost), the relative expression levels of many 
transcripts were drastically altered over the time-course 
experiment, with 61% of genes classified as differentially 
expressed between 0 hours (mean RIN of 9.3) and 84 hours 
(mean RIN of 3.78). This proportion of differentially 
expressed genes is in line with previous reports of the ef- 
fects of warm ischemia on human gene expression in 
tumor biopsies, as assessed using microarrays [20,22]. The 
potential of RNA degradation to skew measurements of 
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Table 3 DE genes across pairs of individuals and overlap with top 10% most variable genes at 0 hours 

GLM individual GLM, individual + RIN Regress RIN, GLM individual 



Test Number DE genes % overlap Number DE genes % overlap Number DE genes % overlap 



Ind 1 vs Ind 3 


69 


86.96% 


553 


45.39% 


268 


75.00% 


Ind 1 vs Ind 4 


48 


89.58% 


401 


50.12% 


190 


78.95% 


Ind 3 vs Ind 4 


100 


87.00% 


573 


49.21% 


299 


73.91% 


All individuals 


160 


85.00% 


1053 


42.64% 


521 


71.98% 



DE, Differentially expressed; GLM, generalized linear model; RIN, RNA integrity number. 



gene expression levels and obscure biologically meaningful 
signals is, therefore, apparent. If there are systematic dif- 
ferences in RNA quality between two classes of samples 
being compared, we predict that the effect of RNA quality 
on relative estimates of gene expression levels would be 
responsible for much of the signal in the data. Further- 
more, as degradation rate is to some degree associated 
with biological function [see Additional file 11: Tables S3 
and Additional file 12: Table S4], it has the potential to 
confound naive comparisons of functional annotations 
as well. 

However, the marked effects of RNA degradation on 
the relative expression level of most genes can be cor- 
rected, to a large degree, using relatively simple statis- 
tical methods. Indeed, we found that the inclusion of 
RIN in our model was sufficient to account for much of 
the effect of degradation and allowed us to identify a 
reasonable number of differentially expressed genes be- 
tween pairs of individuals in our data. These were not 
spurious signals generated by our approach; they recapit- 
ulated observations made at 0 hours (when RNA quality 
was excellent), but were originally dwarfed by the mag- 
nitude of degradation-driven expression changes in the 
uncorrected data. A similar approach - taking into ac- 
count variation in RIN - has been previously proposed 
for the analysis of RTq-PCR data abundance [30]. Never- 
theless, our observations suggest that some of the effects 
of transcriptional degradation in ex vivo samples cannot 
be corrected. Library complexity decreases somewhat 
with lower RNA quality, and some genes (approximately 
5%) can no longer be detected at the later time-points. 
Based on our data we conclude that these effects cannot 
be corrected by simply sequencing more degraded librar- 
ies to a greater depth. 

In a study similar to our own, Opitz et al. [31] sub- 
jected extracted RNA samples from three advanced hu- 
man rectal cancer biopsies to degradation through 
increasingly longer incubation at 60 °C and then consid- 
ered the evidence of time-point/RIN-driven degradation 
using microarray data. The RIN values spanned by their 
data mirror values in ours, but the results do not. In 
contrast to the large RIN-associated effects we observed, 
Opitz et al. reported that of 41,000 tested probe-level 



2data points only 275 demonstrated significant degrad- 
ation effects, with inter-individual differences being the 
predominant signal in the data. Assuming that differ- 
ences in the platforms used (microarrays and RNAseq) 
are not the reason for this discrepancy, one possible ex- 
planation for this stark difference between the studies is 
that lower RIN scores as a result of degradation of ex- 
tracted RNA samples (Opitz et al.) may reflect substan- 
tially different properties than lower RIN scores that are 
the result of degradation of RNA in decaying cells (our 
study). Based on the observations of Opitz et al. we 
hypothesize that degradation rates of isolated RNA may 
be mostly linear and uniform; thus, the degradation ef- 
fects can be accounted for by employing standard 
normalization approaches. In contrast, degradation rates of 
RNA in a dying tissue sample, a situation that mirrors 
more closely conditions likely to be faced by investigators 
in clinical or field settings, is not uniform across tran- 
scripts. Because these differences cannot be neglected in 
downstream analyses, knowledge of the context in which 
degradation occurs is, therefore, crucial. 

Our observations suggest that actively mediated degrad- 
ation of transcripts may occur during necrosis; namely, 
degradation of RNA in a dying tissue may not be a com- 
pletely random process. Biologically mediated degradation, 
whether actively driven by the cells decay machinery [1], 
or simply the consequence of the leakage of RNases into 
cells as membranes are disrupted, is different from the 
heat-driven degradation of naked RNA, which in turn is 
likely to be different from the degradation caused by con- 
tinued freeze-thaw cycles [32]. It is likely that in a dying 
tissue, most degradation is initially biologically mediated 
and directed towards specific classes of transcripts, but as 
the cellular environment continues to deteriorate, the rela- 
tive importance of stochastic degradation may increase 
such that at later time-points degradation becomes in- 
creasingly uncoupled from biological function. 

Additionally, the increased resolution of RNA sequen- 
cing relative to other platforms used to assay gene ex- 
pression levels [25] is both a hindrance and a boon in 
this situation, allowing for detection of subtler differ- 
ences than ever before, but also warranting greater cau- 
tion when analyzing samples of differing quality. 
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Recommendation regarding the inclusion of RNA samples 
in a study 

Previous studies [8-10,23,32-34] have sought to provide 
an RNA degradation threshold below which in-depth 
analysis of RNA is not recommended. However, these 
studies have reached conflicting conclusions. Our data 
suggest that if a simple cut-off value is to be used, a con- 
servative cut-off in the context of RNA degradation in 
dying tissue samples lies between 7.9 and 6.4, the mean 
RIN scores associated with 12 hours and 24 hours in our 
time course experiment, respectively. We observed few 
differences in measurements of gene expression between 
0 hours and 12 hours, as evidenced by the low number 
of genes identified as differentially expressed between 
the two time-points. Thus, it may be tempting to con- 
clude that so long as all samples in any particular study 
have roughly similar RINs explicit correction is not ne- 
cessary. However, when we test for differential expres- 
sion between other close time-points we identify 3,020 
genes as differentially expressed between 48 hours and 
84 hours (difference in mean RIN = 1.3), and 5,293 
between 24 hours and 48 hours (difference in mean 
RIN = 1). It is clear that measurements of gene expres- 
sion are extremely sensitive to starting sample quality. 

Conclusions 

Our observations indicate that useful data can be col- 
lected using RNA sequencing even from highly degraded 
samples. As long as RIN scores are not associated with 
the effect of interest in the study (namely, different clas- 
ses of samples in the study are not associated with dif- 
ferent distributions of RIN scores), accounting for RIN 
scores explicitly can be an effective approach. In our 
study, we were able to identify differently expressed 
genes between individuals even when RNA samples with 
RIN scores around 4 were included. Excluding the sam- 
ples with RIN values lower than 6.4 in our study would 
have resulted in a less powerful design than including 
these samples and globally correcting for RIN values. 
Given these results, we believe that under most circum- 
stances, the most effective approach may be to include 
all samples regardless of quality, and explicitly model a 
measure of RNA quality in the analysis. 

Methods 

RNA degradation 

We obtained Buffy coat samples from four adult Caucasian 
males from Research Blood Components LLC (Boston, 
MA, USA) and separated PBMCs through a standard 
Ficoll gradient purification. Each sample was split into ali- 
quots of four million live cells and resuspended in 200 uL 
of PBS. Cells were kept at room temperature and aliquots 
from each sample lysed every twelve hours by addition of 
700 uL of RLT buffer (Qiagen, Valencia, CA, USA) with 



beta-mercaptoethanol ( Sigma- Aldrich, St Louis, MO, 
USA) added at 10 uL BME/1 mL RLT according to the 
manufacturers instructions. Lysed cells were immediately 
frozen and not thawed until RNA extraction. 

Extraction and sequencing 

RNA was extracted using the Qiagen RNeasy kit. Extracted 
RNA quality was assessed with a BioAnalyzer (Agilent 
Technologies, Wilmington, DE, USA). From these results 
we selected five time-points - 0 hours, 12 hours, 24 hours, 
48 hours and 84 hours - that encompassed a large stretch 
of the degradation spectrum. We then prepared poly-A- 
enriched RNA sequencing libraries for all 20 individual/ 
time-point combinations according to a previously pub- 
lished protocol [25], using 1.5 ug of total RNA per library 
in all instances. In all instances, we added 15 ng (1%) of an 
exogenous RNA spike-in during library preparation, com- 
posed of equal parts Caenorhabditis elegans, Drosophila 
melanogaster and Danio rerio total RNA. Samples were 
multiplexed and sequenced on four lanes (two per library 
preparation strategy) of an Illumina HiSeq2000 using 
standard protocols and reagents. Reads were 50 bp in 
length. All generated reads have been deposited into the 
Sequence Read Archive (SRA) under accession numbers 
SAMN02769865-SAMN02769884. 

Data mapping and normalization 

Data were combined across lanes and data for all libraries 
were randomly subsampled to the lowest observed number 
of reads, 12,129,475. Reads were independently mapped 
to the human (hgl9), D. rerio (danRer7), D. melanogaster 
(dm3), and C. elegans (celO) genomes using BWA 0.6.2 
[35]. All reference genomes were obtained from the UCSC 
Genome Browser [36] . Only reads that mapped exclusively 
to a single site in the human genome with one or zero mis- 
matches were retained for downstream analyses. Following 
mapping, we removed all reads that mapped to more than 
one genome. At this point we also discarded one individ- 
ual - individual number 2 - due to low mappability and 
read quality in the later time points [see Additional file 2: 
Figure SI]. We also mapped all reads using TopHat 2.0.8 
and the same quality thresholds and filtering steps. 

We calculated RPKM [37] for all ENSEMBL v71 [38] hu- 
man genes in our data. Genes with multiple transcripts 
were collapsed into a single transcript containing all exons 
of the gene; where multiple exons of different size over- 
lapped the same genomic region, the entire region was 
kept. We discarded all exonic regions transcribed as part of 
more than one gene. Additionally, we quantile-normalized 
both RPKM and read count-level data across individuals 
using the lumiN function in the Bioconductor [39] package 
lumi [40], which controls for, and dampens, technical 
sampling variance in highly expressed genes. Read counts 
were log2 transformed prior to quantile normalization to 
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generate a normal distribution; analyses were carried out 
on subsequently untransformed counts. 
All statistical analyses were carried out using R 2.15.2. 

Calculation of decay rates 

We estimated the decay rate of the 11,923 genes with an 
RPKM >0.3 in all individuals at all time-points by fitting 
a first order log-normal transform of the classical first- 
order decay equation: 

ln(y(f)) =B 0 -kt + e 

where y(t) is the mRNA abundance of a given gene at 
time t (in quantile-normalized RPKM), B 0 is the abun- 
dance at the initial time-point, and k the decay rate, with 
the variance term e being normally distributed. Data 
from all three individuals were considered simultan- 
eously; that is, we obtained a single decay constant for 
each gene across all three individuals. To control for the 
high FDR of expressed genes at low expression levels, all 
RPKM observations <0.3 were discarded for all subse- 
quent analyses, as in [41]. 

Length and per- transcript %GC content were calcu- 
lated using BEDTools (version 2.16.2 [42]), using the 
same gene models described above. Biotype as well as 5 ' 
and 3' UTR length were retrieved from ENSEMBL v71. 
In those instances where there are multiple UTRs asso- 
ciated with the same gene, we used the median UTR 
length for each gene in all calculations. 

Differential expression and gene enrichment 

Differentially expressed genes were identified using the 
R package edgeR [43], utilizing a GLM framework with 
time, individual ID and sample RIN as covariates, as 
described above. Only those genes with a minimum ob- 
servation of one mapped read across all individuals at 
all time-points were included. Instead of quantile nor- 
malization as described above, all data were normalized 
using trimmed mean of M values normalization (TMM, 
[44]), which corrects for the observed differences in 
informative reads between sequencing libraries. Inter- 
individual variance estimates were generated after variance 
stabilization of read counts using the predFC function in 
edgeR. 

Downstream gene enrichment analyses were carried 
out using the R package topGO [45], using the classic' 
algorithm and a minimum node size of five. All signifi- 
cance values given in the text have been corrected to an 
FDR of 5% or 1%, using the qvalue method of [46]. In all 
cases, the background data set included all 14,094 genes 
with complete observations. 



Additional files 

Additional file 1: Table SI. Relationship between RIN and time to RNA 
extraction. 

Additional file 2: Figure SI. Fraction of reads mapped from generated 
libraries. All samples were randomly subset to the same depth prior to 
mapping. 

Additional file 3: Table S2. Correlations between PCs and covariates. 

Additional file 4: Figure S2. PCA plot of principal components 4 and 
5, the only components significantly associated with inter-individual 
variation in the data. Different colors identify different time-points, while 
each shape indicates a particular individual in the data set. 

Additional file 5: Figure S3. A) PCA plot of the 15 samples included in 
the study based on data from 27,856 genes with at least one mapped 
read to the 1,000-most 3' base pairs in a single individual. Different colors 
identify different time-points, while each shape indicates a particular 
individual in the data set. B) Spearman correlation plot of the 15 samples 
in the study, using only data trimmed to the 1,000-most 3' bp. 

Additional file 6: Figure S4. Density plot of RPKM estimates per gene 
after mapping with BWA and TopHat. Only genes with an RPKM > =0.3 
after mapping with BWA are shown. 

Additional file 7: Figure S5. Spearman correlation plot as in Figure 1 
using data mapped by TopHat. A) Correlations across 33,438 genes with 
at least one instance of one read mapped by TopHat. B) Correlations 
across 29,156 genes with at least one instance of one read mapped by 
BWA. 

Additional file 8: Figure S6. A) PCA plot of the 15 samples included in 
the study based on data from 29,156 genes with at least one mapped 
read in a single individual, after correcting for the effects of RIN on the 
data. Different colors identify different time-points, while each shape 
indicates a particular individual in the data set. B) Spearman correlation 
plot of the 15 samples in the study, after correcting for the effects of RIN 
on the data. 

Additional file 9: Figure S7. Mean RPKM as a function of time (h) to 
sample collection. 

Additional file 10: Figure S8. Effects of sequencing depth on library 
complexity. Dashed red lines indicate median RPKM in each subset. (A to 
D) Density plots of RPKM values in the 0-hour data when subsampled to 
indicated depths. For comparison, the observed distribution of RPKM 
values in the 84-hour data is plotted in each figure in blue. 

Additional file 11: Table S3. Estimated decay rates for 11,923 
tested genes. 

Additional file 12: Table S4. Significantly overrepresented GO terms 
among slowly degraded genes. 

Additional file 13: Table S5. Significantly overrepresented GO terms 
among rapidly degraded genes. 
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